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Abstract: Wc study the behavior of the posterior distribution in ultra 
high-dimensional Baycsian Gaussian linear regression models having p>n, 
with p the number of predictors and n the sample size. In particular, our 
focus is on obtaining non-asymptotic probabilistic bounds on the posterior 
probability assigned in neighborhoods of the true regression coefficient vec- 
tor, /3° , with these bounds used to study contraction of the posterior. We 
assume that /3° is approximately 5-sparse and obtain universal bounds via 
a Schwartz-type argument, though only well-structured priors exhibit good 
properties. Based upon these finite sample bounds, we examine the implied 
asymptotic contraction rates for several examples showing that sparsely- 
structured and heavy-tail shrinkage priors exhibit rapid contraction rates. 
Using brute force, we also demonstrate that a stronger result holds for 
the Uniform-Gaussian prior, which indicates that our main result can be 
strengthened and reinforces the fact that the estimates of the denominator 
in the Schwartz-type arguments are not sharp in the finite sample regime. 
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1. Introduction 

Consider the Gaussian linear regression model 

Hi = xj /3° + a, a ~A/"(0,o- 2 ), i = i,..., n , (1.1) 

where Xi and /3° are p-dimensional vectors. In modern applications, it has be- 
come common to collect ultra high-dimensional data in which the number of 
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subjects n is much smaller than the number of predictors p. Assuming a prior 
distribution II for the unknown coefficient vector /3, our focus is on studying the 
concentration of the posterior distribution for /3 in neighborhoods of /?° under 
the simplifying assumption that (1.1) is the true data-generating model. Al- 
though there is an increasingly rich literature studying the behavior of frequen- 
tist variable selection and point estimation in ultra high-dimensional settings 
[5, 13, 14], there has been essentially no work characterizing concentration of 
posterior distributions for j3 whcnp > n. By placing an appropriately-structured 
prior on /3, one can potentially bypass the troubling issue of tuning parameter 
sensitivity while obtaining a useful probabilistic characterization of uncertainty. 
For example, the posterior can be used to construct credible regions for (3. 

In finite high dimensional settings, usual asymptotic justifications break down 
and it is important to carefully study frequentist properties of the posterior. 
There has been some relevant work in the literature. In [8], Ghosal obtained a 
Bernstein-von Mises theorem providing sufficient conditions to obtain asymp- 
totic normality of the posterior distribution for /3 under model (1.1) allowing 
non-Gaussian residuals, but he requires p to grow much slower than n. Jiang 
[10] studied rates of convergence of the predictive distribution obtained using 
Bayesian variable selection within a generalized linear model having a diverging 
number of candidate predictors, but his results focus only on the predictive pos- 
terior of y given x and not on the posterior of j3. More recently, Bontemps [2] 
obtained a Bernstein-von Mises theorem for a class of scmiparamctric and non- 
parametric Gaussian regression models. For the model (1.1) compared with [8], 
his results allow a faster growth rate of p < n. However, addressing our interest 
in p 3> n requires new theory; in this much more challenging case, we do not 
attempt a Bernstein-von Mises result but instead provide explicit finite sample 
probabilistic bounds on the posterior probability assigned to neighborhoods of 

When p 3> n there clearly needs to be some sort of dimensionality reduc- 
tion or prior information included to make the problem tractable. One common 
flavor of dimensionality reduction corresponds to sparsity, which manifests in 
model (1.1) through assuming that only a small number of elements of the true 
coefficient vector /3° are non-zero. Sparsity is a particularly natural assumption 
that appears implicitly in the literature of model selection through threshold- 
ing and shrinkage. Such model selection was originally performed in the p < n 
case with the goal of surmounting overfitting and reducing the prediction error 
of a model [1]. The philosophy behind the more recent leap to n -C p moti- 
vated Donoho's rhetorical question: "Why go to so much effort to acquire all 
the data when most of what we get will be thrown away?" This philosophy 
gave birth to the field of compressed sensing [4, 7] , which studies reconstruction 
and estimation of approximately sparse signals when n <C p. Much of the rich 
literature oin<Cp problems outside of the statistical community occurs within 
the compressed sensing community. 
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1.1. Contributions 

In this paper, we provide theoretical validation of the Bayesian approach when 
n <C p. Our main result, Theorem 3.1, employs a modification of Schwartz's 
argument (the weapon of choice for Bayesian asymptotics) to exhibit an explicit 
bound on the expected concentration of a posterior for an arbitrary prior. The 
utility of this bound is evident when 

(i) the probability the prior assigns to a small ball around the true /3q is not 
too small; 

(ii) the probability the prior assigns to signals that are not sparse (or approx- 
imately sparse) is very small. 

These observations are embodied in Theorem 3.2, which summarizes conditions 
on a prior n which ensure asymptotic posterior contraction and associated rates 
of contraction. To demonstrate the power of this result, we provide a number of 
worked examples including the Uniform-Gaussian, the Bernoulli-Gaussian, and 
Laplace priors. For these examples, bounds on the asymptotic rates of posterior 
contraction are derived. 

We conclude by demonstrating that Theorem 3.1 is not sharp, and we discuss 
why this is the case. In particular, as we derived our main theorem using a 
Schwartz-type analysis, Theorem 4.1 indicates that the denominator estimate 
in the Schwartz-type analysis is too severe in the non-asymptotic setting. 

1 . 2. Organization 

In Section 2, we fix notation and provide background results that shall be em- 
ployed throughout the paper. Section 3 introduces our main result, the explicit 
bound on expected posterior concentration for an arbitrary prior and a fixed 
problem size. We discuss the role of each term in this bound to indicate ex- 
actly when the bound is useful, and then provide a general theorem concerning 
asymptotic posterior contraction. Lastly, we apply our main result to calculate 
posterior contraction rates for some example priors. In Section 4, we discuss the 
sharpness of our main result and avenues for future exploration. In particular, we 
exhibit a theorem which provides a sharp guarantee for the Uniform-Gaussian 
prior. Appendices A and B contain the supporting technical material for Sec- 
tions 3 and 4. 

2. Preliminaries 
2.1. Notation 

Assuming model (1.1) is the true model with j3 unknown, we focus on the 
posterior for observed data 

y = XP° + e, (2.1) 
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where y is the n-dimensional response, X is the n x p design matrix, Tl(/3) is 
the prior on and e ~ Af(0,a 2 I n )- For a fixed problem (S, n,p, X, 0°), we 
designate the following assumptions: 

(Al) the ith column of X satisfies ||-Xj||! = n for all £ = 1, . . . ,p 
(A2) 0° is ^-sparse (||/?°|| fo < S) 
(A3) a is known 

Whenever we consider a sequence of problems, (S n ,n,p n , X(n),0°(n)) with n —> 
oo, we additionally assume that |/3?(n)| < C < oo for all i and n and that the 
constants C and er remain fixed as n increases. We shall let X T denote the 
transpose of the matrix X, and we let B^ u (f3) denote the i u ball of radius e 
centered at 

The first essential condition for the success of n <C p analysis is that j3° is S- 
sparse or S'-compressible. To formalize these concepts, we recall the best fc-term 
approximations and define the (S, i?)-compressible vectors. 

Definition 2.1. For any f3 G MP and any natural number S < p, let <js{0) 
denote the best S'-term approximation error of f3 so that 

a s (0)= inf > \\P-Z\\ h . (2.2) 

\\i\\e <i> 

Furthermore, for any R > 0, let 

Ps,R = {0 G R p : a s {fi) < R} (2.3) 
denote the set of (S, i?)-compressible vectors. 

Note that Vs,o is exactly the union of canonical S'-dimensional subspaces in W . 
When S and R are clear from the context, we shall simply let V = Ps,R- 

Given the model (2.1), we let f{y\0) denote the likelihood of /3 € MP given 
observed data y € R™, and hence 

fW) = (2^ 2 )-"/ 2 C M-\\V - X(3\\j 2 /2a 2 }. (2.4) 

For any /3 £ K p , Borcl measurable U C R", and Borel measurable function 
F : M n -> K, we let 

W(j(U) and EpF (2.5) 

denote the probability of the event y € U given the parameter j3 and the ex- 
pectation of F(y) given the parameter respectively. We also let U c denote 
the complement of the set U, and we let ljj denote the indicator function of U. 
Finally, for a linear operator X : MP — >• M n , we use the notation ||X||£ u _>.^ to 
denote the operator norm 

\\X\\e u ^ v = max \\X0U v . (2.6) 

{p:\\(S\\e u = l} 
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2.2. The Dantzig selector 

The Dantzig selector is an essential ingredient for our proofs. The properties of 
the Dantzig selector depend heavily upon the design matrix X, and one simple 
assumption laid out by Candes and Tao [5] is that the column norms of X all 
equal one. In our situation, it is convenient to rescale X so the column norms 
are all y/n. We shall use X = -^X as an intermediate quantity to translate the 
results of Candes and Tao to our setting. 

Definition 2.2. For a response vector y £ M. n and a design matrix X, the 
Dantzig selector is the solution to the program 

min subject to \\X T {y - X/3)\\ ioo < X p a (2.7) 



where X p = y2(I + a) \ogp. The role of the free parameter a > is made ap- 
parent in Theorem 2.1 x Wc let f3 denote the solution to this linear programming 
problem, and set /? = j3/^Jn. 

In order to ensure reconstruction properties for the JDantzig selector for all 
/? of a sufficient sparsity, we must put conditions on X. The first quantity ^of 
interest is the restricted isometry constant, which is the smallest constant 5k{X) 
satisfying 

{l-S k )\\b\\l<\\Xb\\l<{l + 5 k )\\b\\l (2.8) 

for all b £ Vk.o- Ideally, the constant S k is small enough to ensure that sufficiently 
sparse b are far from the kernel of X. The other quantity of interest is the 
restricted orthogonality constant 9k,k'{X), which is defined to be the smallest 
constant such that 

\{X T b,X T ,b'}\ < fe ,fc'||&|k||&'|k (2.9) 

for all b, b' , disjoint T, T' C {1, 2, . . . ,p}, where Xt, Xt' are the matrices formed 
by concatenating the columns of X with indices in T and T' respectively, \T\ < k, 
\T'\ < fc', and |T| + \T'\ < p. Again, the ideal Q^.w i s small, so disjoint collections 
of columns of X span nearly orthogonal subspaces. 

With the restricted isometry and restricted orthogonality constants defined, 
we are now able to translate the theorem of Candes and Tao into our setting. 

Theorem 2.1 (Candes and Tao '05). Let S be fixed so that 5 2 s(X)+6 s ,2s(X) < 
1 and fix R>0. If /3 £ Vs,r, then the rescaled solution to (2.7) satisfies 



H/3-/3 k<T37^V n + 2 ~ J— 9^ (2 - 10) 

with probability greater than 1 



For completeness, we prove this version of the theorem in Appendix B. Since 
we shall employ the above condition on X to invoke this theorem and to perform 
further analysis, we add the following assumption to our arsenal. 
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(A4) S = 5 2S {X) and 9 = 9 s . 2 s(X) satisfy 6 + 9 < 1 

In the case of increasing problem sizes, we shall assume that S and 9 remain hxcd 
(or are at least nonincreasing as the problem size increases). While at first glance 
this may seem to constrain the applicability of our theory, such conditions are 
standard in the theoretical literature on sparse reconstructions and obtaining 
universal statements in problems where n <C p without similar conditions is an 
open problem. Another possible concern is that verification of these constants 
is combinatorially complex, however it has been well established that many 
families of random matrices satisfy this condition with high probability when 
n > CS(\ogp) c , for some c > 1. In particular, matrices whose entries are drawn 
i.i.d. A/"(0, 1/n), and matrices with sub-Gaussian entries satisfy this condition 
with high probability with c = 1. Other matrices that satisfy such a condition 
(albeit with c > 1) include n x p matrices whose rows are drawn (uniformly) at 
random from orthonormal bases such as the Fourier basis. The interested reader 
is referred, for example, to [4], 



3. Main result and examples 



Theorem 3.1. Suppose /3 G V = Vs,r and that H is an arbitrary prior on 
Let B = {P G R p : \\/3 - /3% 2 > 2e}, with 



8a (l + a)Slogp , 1-5 + 6 R 

£ = \—s—0y n + 2 T~7^7^' (3 ' 1} 

and assume (Al), (A3) and (A4). For any a > 0, k > 0, < v < a, and all u 
and v satisfying 1/u + 1/v = 1 with u > 1, 

E P oU(B\y) < 1 (3.2) 

P Q V7T logp 



n{B\v) 

IL(V u , K )p-» 
1 



(3.3) 

+pr /}0 (A c K ), (3.5) 



where 



V^ K ^{peR p :\\l3~f3°\U u <C V , K } 



C v K — 



^2\\X T X\\ lu ^i v (J 2 V \ogp 



2a 2 v\ogp 



and 



+ ^/^ + 2\\X T X\\ e ^ e ^v\ogp ) V \\X T X\ 
A K = {yeR n :\\X T (y-X^)\\ lv > «}. 



(3.6) 
(3.7) 

(3.8) 
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Ultimately, we want to apply this theorem to obtain concentration bounds 
for large classes of sparsity promoting priors. To do this we must understand the 
contribution of each individual term of the inequality stated in Theorem 3.1. 
Immediately, we may observe that the terms (3.2) and (3.5) are independent of 
IT. The term in (3.2) comes from using the Dantzig estimator in our hypothesis 
test. As such, we have little control over this term aside from adjusting the 
a parameter. The fourth term, (3.5) is controlled by the parameter k, and 
measures the tail probability of a random vector's magnitude. With the simple 
choice of u — 1 and v — oo, wc may set k = y/n\ p a and bound this term by 
{p a \/ir logp) -1 . For all of our examples, we adopt this strategy to essentially 
eliminate (3.5) and add a coefficient of 2 to the term (3.2). 

Having discussed the terms that arc independent of the prior, we turn our 
attention to the middle terms. The term (3.4) depends inversely upon U(T> U:K ), 
the probability the prior assigns to a ball with a very small radius around (3°. 
The behavior of this term illustrates the role that sparsity plays in the behavior 
of the posterior. To see this note, that in order to control this term, we must 
increase a. However, if n(2?„, K ) is proportional to the volume of T> VyK C W , 
then a must overcome and 



may be quite large. On the other hand, a sparsity promoting prior can rescue 
the asymptotic behavior of the posterior. Because a sparsity promoting prior is 
concentrated very near S'-dimensional subspaces, the probability assigned to a 
small ball around a sparse vector is proportional to the volume of a ball in K s . 
Thus, a can remain O(S), and e shrinks asymptotically if S 2 logp is o(n). 

Finally, we discuss the term (3.3). The presence of H(B\V) in the numerator 
indicates that this term can only be controlled if the prior encourages sparsity. 
The presence of the term H(D UK ) in the denominator then indicates that this 
term can only be controlled if the prior encourages sparse j3. In particular, if 
n is a compressible prior (see [6, 9]), Ii(B \ V) should be small. In general, the 
decay of H(B \ V) must overcome the growth of a p s term produced by n(2?„ )K ) . 

Based on Theorem 3.1, we may exhibit a general posterior contraction result 
depending upon Il(B \ V) and n(XV ;K ). 

Theorem 3.2. Suppose (S n ,n,p n , X(n), /3 (n)) is a sequence of problems sat- 
isfying (Al) through (A4), and that Tl n is a sequence of priors on M. Pn such 
that 




n(z>„) > P -<* 

ii. IL(B n \V)<p, 



■n 



where 




(3.9) 
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Let B n = {(3eW : \\0 - /3°(n)lk > 2e n } where 



8er , /( l + a n )S n logPn , g ^ 



" 1 

TTien 

E^on n (B n |y) < (2+^+ 1 )^— l^+p-^n+i, (3 . n) 

Example 3.1. First, wc turn our attention to a case that admits the simplest 
(but still somewhat involved) analysis. We let {0, 1}^ denote the p-length binary 
sequences with exactly S nonzero entries and fix the model 

ft ~ 7l AA(0, V 2 ) + (1 - 7i )5 (3.12) 
7 ~ Uniform({0, 1} P S ) (3.13) 

where Uniform({0, 1}^) is the distribution with equal (1/(5)) probability for 
all 7 e {0, l}g. First, note that 11(23 \ Vs) = 0, which eliminates the term 
(3.3). Next, set u = 1, v = 00, and k = \/2(l + a)na 2 \ogp. This ensures that 
prgo {A C K ) < pa ^i ogp ■ For simplicity, we assume that v = 1 and obtain following 
reduction from Theorem 3.1: 

E P oIl{B\y) < ( 2 + ) — 2= (3.14) 

We now only need to estimate n(X>„, K ). To that end, suppose that 0° has support 
T and denote by Vol(2?^ K ) the volume of an 5-dimensional £i-ball with the same 
radius as T> UiK . Let M denote the minimum ofY\ ieT Af(0i\O, V 2 ) over T>^ K . Then 

x -1 

p 



Il(V v , K ) > MVol(:D£ jn( 7 = 1 T ) = ( % ) MVol(V^ K ) (3.15) 



> 



( ^V\27ry 2 )- s / 2 e^^°^ 2 / 2V %- c ^/ 2y2 g-Hj (3-16) 

\SJ r(i + s) y ' 



J2V 2 -CIJ2V 2 ,„ r xS 

> g (^) (3-17) 

^(e^) 2 \ P J 

Since ||X T X|| fl _ > ^ oo = max |(X T X) i:) -| = n (see, e.g., [12]), we have 

ij 



C v K — 



y/2na 2 (l + a) \ogp + y/2n<r 2 (l + a) \ogp + 2na 2 \ogp 

I a 2 \ogp 
2(2 + a)n 



\j2na 2 \ogp \ 1 2a 2 logp 



(3.18) 
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Combining (3.17) and (3.18), we have 



I x s 



s l 1 lo &P 



p y (2 + a)n 




where we have set 770 = e C ".^/ 2V and rji = e c l 2V y ^fip^ ■ Note that 771 is 
constant, and (though it depends on n,p, and a) r/Q is approximately constant 
in the asymptotic regime. Combining (3.19) with (3.14), we have the bound 



E p «n(B\y)< I 2 + -g-f Z / ( 2 + Q ) n I I — . (3.20) 

Now, consider a sequence of problems (S n , n,p n ,X(n), (3°(n)) satisfying (Al) 
through (A4), and suppose we employ the Uniform-Gaussian prior with param- 
eter V fixed for each n. Then the bound in (3.20) applies to E^o(„)II(£>„|y(n)) 
for each n, where the radius of B n is 2e n with 



1-8- 



(1 + a n )S n log p n ^ 



The most problematic contribution to the bound in (3.20) is p~", but we may 
adjust a n so that p^™ overcomes this term asymptotically. Thus, in order to 
obtain asymptotic consistency, we require a n — S n — > 00 and (l + a n )S n \ogp n = 
o(n). This is possible if we set a„ = S n logp„ and logp n = o{^/n). 

Example 3.2. Now, we assume the model 

Pi ~ 7iA/"(0, V 2 ) + (1 - 7l )<5 (3.22) 
ji ~ Bernoulli^) (3.23) 

where <j> £ (0, 1) controls the sparsity of the prior. As in the Uniform-Gaussian 
prior, we shall set u = 1, v = 00, v = 1, and ft = \/2(l + a)ncr 2 logp. In order to 
carry out a meaningful analysis in this case, we must assume that (3° is AT-sparse 
and that p<\> = A'. By Chernoff-Hoeffding, we have that 

Note that this is a bound for H(B\Vs ,/) ■ We are left with producing an estimate 
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for IL(V VtK ): 

n(2V) = ^n(x> ViK | 7 )n( 7 ) (3.25) 



> 



7 

p-K 



E (" ^^(l-^-^^lT) (3-26) 



fe=0 



k=0 



* / v 7 ^^ r(l + X + fc) 
2</>C i ,, re \ K p sr ^ (p — K\ ( 2(f>C v , K \ k lA jsp-K-h \_ 



'/() 



770 UW(2T^ 1 "7 + 77V(2T^ ■ (3 - 27) 



Here, 770 is as in the previous example and 771 = e c / 21/ y J^-. In this case, the 
term (3.3) in Theorem 3.1 is bounded by 



L (1\ S+K ( 1. I 1 °zp ) ( p- K Y s d _ £ . m 1 ! / l °sp \ 

m\Sj (2 + a)n) \p-SJ \ y pp\j[2 + a)nj 

(3.28) 

Now, consider a sequence of problems (K n ,n,p n ,X(n),f3°(n)) satisfying (Al) 
through (A4), and suppose we employ the Bernoulli-Gaussian prior with pa- 
rameters V and <j) n = K n /p n for each n. In order to handle the term (3.4), we 
need to choose a n so that a n — K n — > 00. Thus, we set a„ = K n \ogp n . In order 
to deal with the term (3.3), we require S n — K n logp n — > 00. Finally, to shrink 
the radius of B n , which is twice 



8cr ^ I ( 1 + a n )S n logp n 2g ^ 



" 1-5- 

we may assume that a n = K n \ogp n , S n = A'„log 2 p„, and thus we need 
A'„log 2 p„ = o(y/n). 

Example 3.3. Now, we examine the concentration of the posterior under a 
Laplace prior. First, we set 



A '* 



2 



e'^dx (3.30) 

-R 
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and obtain the bound 

n(B\V s ,B) < n(7% R ) (3.31) 

-J 
s 

On the other hand, we have 

11(2?) > e- x ^^u(B^ vH (0)j (3.33) 



< («)• («izpr . (3 , 2) 



-AII/?°IU, 7(P+1,AC,, K ) 

Ar( P + i) ■ (3 ' 35) 



From this, we arrive at the bound 



3 oTL{B\y) < 1 



oXmw XpT(p+l) \ 2 

7(p + 1, \c v , k ) J P a V^Wp 

Xpe^ a ^T(p + l) |>A 5 / p(l-0) \ P ~ S (3 37) 



In both of these terms, we must overcome the L(p + 1) = p\ factor. While the 
second term can be handled by scaling A appropriately, getting the first term 
to decay requires a n ~ 0{p n ), which means that n must overcome p n \ogp n . 
This is clearly impossible if we are interested in the regime. Since these 

bounds are sharp, we must conclude that Theorem 3.1 is not powerful enough 
to reveal any contraction behavior for the posterior under a Laplace prior. This 
of course does not mean that the Laplace prior does not concentrate, but that 
our theory cannot say anything about its concentration. 

4. Evidence supporting a sharper result 

Theorem 3.1 is the first result of its kind, but we are able to construct examples 
which behave provably better than what the Schwartz-type analysis can yield. 
In particular, with a little effort we may demonstrate the following concentration 
result for the Uniform-Gaussian prior. 

Theorem 4.1. Assume (Al) through (A4) and thatH is the Uniform-Gaussian 
prior with parameters S and V . Fix a > max |o, ~^q^26-32S } an ^ ^ 



(l-S)V 2 + (J 2 /n ' ^ 
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where the positive constants C±, Ci and C3 depend only S and 9. If (n^-r + 
yj)£ 2 > S/2, then there exists a constant n = T)(a ) S t O) > so that 



*l&i?)\v)> [ s/2 [ - ^ (4-2) 



with probability greater than 1 — l/p a y/ir\ogp on the draw of y. 

Comparing this bound with that given in Example 3.1, it is clear that this 
theorem is much sharper. In particular, note that we no longer need to scale a to 
obtain asymptotic contraction. A very crude approximation in the asymptotic 

regime would be e rts J s '° s p and 



with probability exceeding 1 — l/p a ^n logp. In order to obtain contraction, 
we simply let Slogp = o(n). This is as good a result as one can hope for, as 
(depending on X) n must be at least CS\og(p/S) to guarantee (A4). 

Why doesn't Theorem 3.1 provide us as sharp a guarantee? While Schwartz- 
type arguments are perfect for asymptotic situations where the likelihood is 
concentrating very heavily on a small ball, in a finite undersampling regime the 
reduction of the denominator term to a small ball is too dramatic, and excludes 
a large portion of the integral. In particular, focusing on a small ball of the 
Uniform- Gaussian prior introduces the term (^J and when we approach the 
term 

U(B\V) 



n(2V) ' 



(4.4) 



the shrinking radius of the ball T> V K forces us to require a suboptimal sparsity 
level. 

These observations and Theorem 4.1 suggest that a stronger result should 
hold. In particular, the only pertinent restriction the prior should obey is that 
II('P C ') should be sufficiently small. This condition forces the posterior to con- 
centrate on the set of sparse (3 and the likelihood automatically concentrates on 
the sparse solutions to the linear inverse problem. 



Appendix A 

The proof of our main result is a modification of the argument originally devised 
by Schwartz [11]. In order to employ her strategy, we first find a large set of y's 
for which the numerator of n(/3|z/) admits a controllable upper bound, and then 
we find another large set of y's for which the denominator admits a controllable 
lower bound. 
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In the literature, this former set is identified with a hypothesis test which en- 
joys strong consistency behavior. As is often the case, we may base this hypoth- 
esis test on a frequentist estimator, and our estimator of choice is the Dantzig 
selector and we employ Theorem 2.1 to exploit the theoretical properties of the 
Dantzig selector. The theoretical properties of the LASSO estimator [3? ] could 
also be exploited to form such a hypothesis test. 

Proof of Theorem 2.1. Let /3 = y/n/3°, set h = /3 — /3, and suppose To and Toi 
follow the precedent set in [5]. First, we note that 

\\h TS \\h< II^Tjk+2||/3 T(f |k. (4.5) 
By Lemma 3.1 of [5], we then have 

\\h\\ ia < \\h Toi \\i 2 + S-^WhrJ^ (4.6) 

< \\h To Aiz+S- 1/2 (\\h To \\ tl +2\\(} T§ \U 1 ) (4.7) 

< II^T 01 ||£ 2 + ||^Toll^+25- 1 / 2 ||/3 Tt c||, 1 (4.8) 

< 2\\h Tm \U 2 +2S- 1 ' 2 \\Pt S |k- (4.9) 
Moreover, Lemma 3.1 also gives us 

\\h Toi h 2 < T ^l|A|: i A^|| £2 + T ^5- 1 / 2 ||/3 T c|| fl (4.10) 

^ r^^^ + T^^dl^o Ik +2||^|UJ (4.11) 

< sU2Xp + JL|| /lTjf2 + (4 . 12) 

Manipulation of this last inequality yields 

\\h Tm \U 2 < l ^f_ e K+ l 26 5 _ e S~ 1/2 WT S \\i 1 (4.13) 
Combining bounds, we arrive at 

INk < r^^ sl/2Xp + 2 Y^l^ s ' 1/2 ^ T s^ (4 - 14) 

= Y^^S 1/2 \ + ^^^S-^%\\ ei (4.15) 

< ^S^^S-^R (4.16) 

Scaling by yfn then yields the result. □ 

To simplify what follows, we set e equal to the bound in Theorem 2.1 and 
then define 

V%,r = {f3eR p :\\(3-(3°\U 2 > 2e} n V S ,r, (4.17) 
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which we shall denote as V when there is no possibility for ambiguity. We are 
now ready to define the set of y's which produce controllable denominators, and 
we also prove the properties we shall exploit. 

Lemma 4.1. Define the critical region C = {y E W l : \\/3 — f3°\\e 2 > e} and our 
hypothesis test is then <&(?/) = lc(y)- Then, 

1. E«o$ < — 

2. supE^(l -$) < 



pev p Q V7rlogp 
Proof. First, we compute the type I error rate for $: 



E^o$ = pr 0O (C) < 1 (4.18) 

p a V TT log p 



In a similar fashion, we have 



supE^l-^^suppr^H/?-/? !!,, <e} (4.19) 
pev pev 

The reverse triangle inequality then yields the bound 



su P E„(l-$) < sup pr^H/3 - p\\ i2 >-e+ \\0 - /3°\\ e . 2 } (4.20) 
pev pep 

< suppr i8 {||/3-/3|U 2 > e} (4.21) 
Pev 

where we have used the fact that \\j3 — (3°\\i 2 > 2e for all (3 <G V . Moreover, since 
(3 is S-sparsc for all j3 G V, pr^{||/3 — (3\\i 2 > e} < pQ ^logp ' an< ^ we orj tain the 
desired bound on the supremum. This completes the proof. □ 

Because of the behavior of the Dantzig selector, this hypothesis test is only 
useful for distinguishing between sparse vectors. That is, a large set of non-sparse 
vectors may trigger a type II error. While this may be a damning indictment for 
its utility as a practical hypothesis test, we merely employ $ in the theoretical 
argument for our main theorem. 

Proof of Theorem 3. 1 . We apply the standard divide-and-conquer strategy orig- 
inally devised by Schwartz [11] to obtain 

U(B\y) = Hy)U(B\y) + (l-^(y))U(B\y)l A M (4.22) 
+(1 - $(y))II(B|tf)U.» (4.23) 
< ${y) + (l-$(y))IL(B\y)l A M + lA % (y) (4.24) 

By Lemma 1, we have that Eg $ < pa \ a%v - so this term is immediately 
eliminated. Additionally, we have that E^o (y) = pigo {A C K ). Having dispensed 
with the first and third terms, we proceed to attack the middle term. 
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Wc first multiply this remaining term by a form of 1 to obtain 

U„ (4.25) 



Now, we bound the denominator by the expression 



/ 7§M dn(/?) - ex p{^ lo grf n (^)) ( 4 - 26 ) 



where 



1 f(v\0°) 

V v[ y) = {/? GR-— lo g «i<, } (4.27) 

= {0 e R p : j^(\\y X0\\l - \\y - X0% 2 ) < 2a 2 u} (4.28) 

= {0eW :\\y-X0\\l~\\y-X0 o \\l <2a 2 vlogp} (4.29) 

By applying the Holder inequality and the definition of the operator norm, it is 
easy to see that the left-hand side of the inequality in (4.29) is 

= ((y-X0) + (y-X0°),(y-X0)-(y-X0 )) (4.30) 

= (2y-2X0 o ,X(0 o ~0)) + (X(0 a -0),X(0°~0)) (4.31) 

< 2\\X T (y-X0°)\\ l J0-0% u + \\X T X\\ l ^ lv \\0-0°\\l (4-32) 

< 2k\\0 - 0°\\ iu + \\X T X\U u ^ l J0- 0°\\l (4.33) 

since k > \\X T (y - X0°)\\ iv for y G A K . Following (4.29), we force \\0 - 0°\\ l2 
to satisfy the inequality 

||X T X|| £u ^J|/3-/3°|| £ 2 u +2k||/3-/3°||^ <2a 2 i/logp, (4.34) 
which then yields the bound on 11,9 — 0°\\e u 

. V^ 2 +8\\XTX\\ t ^ tv **v logp-2«; 

2\\X^X\\ e ^ lv (4 - 35) 



2n+ sJ^k 2 + %\\X T X\\ l ^^a 2 v\ogp ) 2\\X T X\ 



^8\\X T X\W^a 2 v\ogp \ y/%\\X-rX\\i^i ^v\ogp 



V2||X T X|k^„a 2 i/logp ^ / 2o*v\ogp 



« + y/n 2 + 2\\X T X\\ lu ^ lv cr 2 v\ogp J V ll^ 1 ^ lk->fe ' 

Based on this sequence of inequalities, we conclude that T> u ,k C T> u (y) when 
y € „4 K - Putting this all together, we have that Tl(T) v (y)) > n(P I/ , K ) for y G ,4 re , 
and hence 
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for all y £ A K . Applying this, we obtain the bound 



(1 - $(«)) L 4MikdTlW) 
(1 - *(„))II(fl)U.(y) < ± " ,n ( g7 — • (4-39) 

Taking the expectation of the numerator and applying Tonelli yields 



I j^dm = ^E, (l-$(y))-^^dn(^4.40) 



B 



Ep(l -$(y))dTL(l3) (4.41) 



B 

We now split this and bound using Lemma 1: 

Ep(l-*(y))dn.(P) = f E^(l-*(y))<fll(^) (4.42) 

B JB\-P 

+ / Rp(l-$(y))d[L(fi) (4.43) 

< U(B\V) + U(V) - 1 (4.44) 

< n(B\P) + — ^i== (4.45) 
This establishes the result. □ 



Appendix B 

In order to prove Theorem 4.1, we shall require some additional notation. For a 
fixed a, V, X, y, e, and 7{0, 1} P , we let denote the matrix obtained by deleting 
the columns of X with indices i such that 7, = 0, P 1 denote the orthogonal 
projection onto the span of the columns of A 7 , 

■l„„„ 1 



S 7 = ( ^X^X 7 + yjlsxs ) (4.46) 



and 



^ = -is T r r i/. (4.47) 



Additionally, we shall slightly abuse notation by letting /3 7 denote the projection 
of P onto the coordinates indicated by 7 and the Hadamard product of /3 with 
7 depending upon the context. Finally, for 7, 7' £ {0, 1} P , we shall write 7 < 7' 
to indicate that 7' dominates 7 entry wise. 

We first begin with a simple probabilistic noise bound in the spirit of Candes 
and Tao [5] . The proof is a simple application of the Markov inequality. 
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Lemma 4.2. Assuming that e, ~ A/"(0, a 2 ) for i — l,,..,n and \\Xi\\f = 1 for 
i = l,...,p. If 

£ = {eeR n : \\X?e\\l < 4a 2 (l + a)| 7 | logp,V 7 e {0, 1}*}, (4.48) 



then 



pr e (£ ) > 1 - 



1 



p Q Vtt log p 



(4.49) 



The next lemma we shall require deterministically bounds the difference be- 
tween similar operators. 

Lemma 4.3. Assume (Al)-(A^), if j G {0, 1} P satisfies 7 < 7 tmd \-y\ < 2S, 
we have that 



P 7 - X 1 {X^X 1 + ^2-f| 7 |x| 7 |) ±X y 



< 



n(l-5)V 2 + a 2 



(4.50) 



J l7|x|7l ~ ( X J X l + y2 J l7|x| 7 |) X 1 X 



< 



.„ ~ n{l-5)V 2 + a 2 



(4.51) 



We shall also require bounds on the determinants of the restricted opera- 
tors. This lemma and the preceding lemma follow from the RIP hypothesis and 
application of an SVD. 

Lemma 4.4. Assuming (Al) and (A4), if 7 G {0, 1} P satisfies | 7 | < 25", then 
we have that 



n(l + 8) 1 
~~ ^ + V 2 



- H/2 J , (nil- 5) 1 

< det(E 7 ) < ' 



V 2 



-ItI/2 



(4.52) 



On the other hand, if | 7 | > 2S , then 

'n (l-S) 1 

CJ 2 V^ 2 



det(E 7 ) < 



(4.53) 



Now, we exhibit a bound on the difference between norms of different recon- 
structions. 

Lemma 4.5. Assume (Al), (A2), and (A4). If 7,7' G {0, 1} P satisfy 7 < 7 
and j 7' j < 25", then 



(.Xf3°) T (P Y - P 7 )X/3° <- n (l-5- ||# 0Vy ,||i 



(4.54) 



Moreover, 8 + ^Zx < 1 • 
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Proof. Since 7 < 7, we have that P 7 X(3° = X 10 (3° , and therefore 

(X/3°) T (P r - P 7 )X/3° = (X 7 o^ ) T (P r - / nxn )X 7o /3 7 o (4.55) 

! \ T 

= [ Xy>\^y/P^o\y j (Pj' — -fnXn)-^-y \T'^T°\7' 
= -|l^7 \7'/ 3 7°\7'll^2 + ll^7'^7 \7'^7°\7'll?2- 

We then have 

||P 7 'X 7 o\ 7 /^o\ 7 ,||£ 2 < Y3^ll^7'^7°\7'^7°\7'llL (4.56) 

< n^\\^,\\l (4-57) 



since 

\\X^X l0 \^P°o^\\i 2 = (X 7 X 7 X 7 o\ 7 ^°o\ 7 ,X 7 o\ 7J 8°o\ v7 ) (4.58) 

< e||X 7 T X 7 o X7 /3 7 o X7 |j f2 ||/3 7 o\ 7 || e2 (4.59) 

implies 1 1 JT^^^o \^ /3° \ T 1 1 1 2 < 9 2 \\fty0\ 7 \\j 2 . Combining this with the fact that 

n(l - S)\W° 7 o W \\l < ||^7°\7'/ 3 70\7'll£ 2 ' ™ 0btain thG b0Und 



(X(5°) T (P y - P 7 )X/3° <- n (l-S- ||^„ X7 || 



(4.60) 



Finally, note that 



52 o2 



+ 5 < 1 6 < (1 — 5) ==> < 1 - <5 <5 + < 1. (4.61) 

l—o l—o 

□ 

This next lemma bounds the differences of inner products of reconstructions 
with the noise vector. 

Lemma 4.6. Assume (Al) through (A4), and e £ £ from (4-48). If"f° < 7 and 
\Y\ < 2S, then 

1-5 + 6 



|(X/3°) T (F y -P 7 ) e | < 2a — I ^{l + a)m^{\^\i\^'\}n\ogp\\^ Y \W. 
Proof. We compute 

|(X/3 ) T (P y -P 7 )e| = \(X l0 (3° 7 o) T (P Y -I nxn )e\ (4.62) 

= |(X 7 o X7 /3 7 o\ 7 ) T (Py -/„ x „)e| (4.63) 

= |(/3°o Xy ) T ^ 7 o\ y e + (X 7 o Xy /30 OX7 ,) T Py e | (4.64) 

< ll^ 7 o V7 e||, 2 ||^o V7 ||^ + \\PyXT aw e\U 2 \\p° ow \U 2 

< ll^ 7 o V7 ,e||, 2 ||/3°o\ 7 ,|U 2 + Y^\\X^e\\ e2 \\^ 0W \\ e2 



< 2cr 1 



+ 1^) v / (l + «)«max{|70\7'|,|7'|}logp||/3 7 o X7 ,||, 
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□ 

Our last lemma is used to clean up a calculation that arises. 
Lemma 4.7. Let 



,' (i+«)giogp = sV2a / (r+^iogp i (4 65) 

n 1 — <5 — 6 1 V n 



Then 

( Q 2 \ 1 — 8 -\- 6 n ( 9 2 \ 

-nil- 5+ F t 2 +4V2ct— — y/{l + a)Snlog P T < -- I 1 - 5+ r 

\ 1 — 0/ l — o z\ 1 — 0/ 

Proof of Theorem J^.l. First, we exhibit an explicit formula for the posterior. 
Given any measurable set U C R n , we have that 



f(y\p)dll(p) (4.66) 

u 

i 

P JU 



S 



i 



^ f (2na 2 )-^ 2 e-^- x ^/^\2nV 2 )- s / 2 e-^^/ 2T2 dl3 1 

Jiz {o,iy s Ju 

p ||j,-X/3 7 ||| ll£ 7 llf, 

(2^o- 2 )-"/ 2 (27r^ 2 )- s / 2 ^ / e 53"- iv**^ (4.67) 

7£{0,1}I ^ 



Completing the square gives us 



-(/3 7 -Mt) S 7 (#r -M7) + ^llyll^ - o ^7 E 7 ^7 



2(J 2 1 2V 2 2 K>1 ^' 7 yri ^' 2<r 2,,yut2 vr* 
Therefore, if U = M n , we have that 

e 2o 

and hence 

M det(E 7 )e^ ; 

7e{o,i}? 

On the other hand, 

f(y\0)<m(J3) (4.69) 

'\ _1 r lis— *0 7 l|2 II^IlL 

J) (W)-*/ a (!hrV 9 r*'" £ f t _e ^~^d Pl 



e ^ ^d/3 7 = (27r) s/2 det(S 7 ) e -^ ll!/ll ^ + 2^ s , ^ (468) 



J^f(y\/3)dU(P) = e -^ll»ll2 a Q (2^a 2 )""/ 2 ^- s ]T det(£ 7 )e^ 



76{ 0,l}g- /S 2l(W 



e -^H«IU 2 P (2™ 2 )- n/ V- s V det(£ 7 ) e ^ E ^ / - — s d/3 7 

V s / 7£ ^i } | ^gtfo) ^ det(E 7 ) 
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Putting this all together, we have 



/ f(y\/3)dll{i3) 



II(i?&G8 )|ff) = (4.70) 



e -i(/3 7 -M 7 ) T S- 2 (^-^) 



V det(E 7 )e^T S ; n-y / df3 1 

7g Ri}g Jb ^) det(E 7 ) 

det(E 7 )e^ s ^ 

7 e{o,i}| 

Now that we have an explicit expression for the posterior, we bound this ex- 
pression below by reducing the sum in the numerator to the indices in the set 



G = | 7 E {0, 1}> : ||/V\ 7 lk < V (1 + T lQSP } ^ 71 ) 

That is, we restrict to the indices that capture most of the mass of If 7 £ G, 
then 

||/3 7 -M 7 |k = \W° J -(X^X^ + ^I nxn )- 1 X^(Xp° + e)\U 2 (4.72) 

2 

< Uls^s-iX^X-y + ^In^X^Xy)^ (4.73) 

2 

+ ||(X^X 7 + ^/ nx „)-i^X 7 o X7 /3 7 „ X7 || f2 (4.74) 

2 

+ || (X 7 T X 7 + ^ 7„ xn )- 1 X^e|k (4.75) 



a 2 



< „,■, ^ , -a ll^k (4-76) 



n(l - 5)F 2 + o- 2 "^ 7I 
+ n(l-4 + ( T 2 /^ 2l!/3 °° X711 ' 2 



(4.77) 



1 ^nS{l + a)a 2 logp 

+2 n(l-<5)+aV^ (4J8) 
a 2 W% 2 + n^ 2 ||/3 7nX7 || £2 + 2qy V(l + a)Sn\ogp 
n(l - 5)V 2 + a 2 

Given this bound, we may conclude that (using the bound ||/3 7 ||f 2 < Cy/S, but 
note that a large enough V may be chosen to dampen this contribution) 

||/3° -M 7 |k < ll/3 7 o X7 |k + ||/3 7 -M 7 lk (4.79) 
a 2 W% 2 + (n(l - ,5 + g)^ 2 + a 2 )||/3 7 o X7 lk 2 + 2^^/(1 + a)Sn\ogp 
~ n(l-S)V 2 + a 2 

< e (4.80) 
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and hence 

f e -i(/J T -M T ) T S; 2 (/3 7 -M T ) r e -^(/3 T -^) T S; 2 (/3 T -M T ) 

J Bt ,.,. ^T"^ d ^ - L„. . ^s7^ d/3 ^ 



'B2e(/3°) V27T det(E 7 ) -/b^(^) det(E 7 ) 



/• e -|(«Tr+^)ll^-A'Tll? 2 

> / ^dl3 1 

> i_ e -i(«^+^)- 2 . (4.81) 
1 



This last expression holds because of the hypothesis {n^S- + -^i)e 2 > S/2. At 



this stage, we have constructed the bound 

1 ,, T v- 2 



V\ pr det(E 7 )e3'S s -r ^ i i * 

n(^(/3°)| 2/ ) > ^ eG \ 7 \ t T . (1 - e "*^+^ ). (4.82) 



We now approach the expression 



V eG cIet(£ 7 )e^ s ^ 



E T£{ o,ir s det(E 7 )e*" 
1 



(4.83) 



i | £ 7e{ o,i }g \ G dot(S 7 )e*"-r S 7 ^ 

£ 7eG det(E 7 )e^ E ^ 
1 



> 



7'e{0,l} P s \G E 7eG dHt(s7) e " T 1 ' 



det(S 7 ) KMyS^V-y-^SyVv) 
7'e{0,l}|\G 2^ 7 o< 7 dot(Sy) e 7 7 

In the last step, we have reduced the index set over the sum inside the continued 
fraction from G to its subset {7 £ {0, 1}^ : 7 < 7}. Based on this initial bound, 
we shall seek upper bounds on the expressions 

det(Sy) c i(^ <E -'^,- M T E -2 MT) 
det(E 7 ) 

for all 7 < 7 and 7' e {0, 1}£ \ G. First, we note that 

det(E 7 ,) < ^(l + ^; + ^ 5/2 (4 . 85) 



det(E 7 ) ~ \n(l - 5)V 2 + a 2 
by Lemma 4.4. The remaining expressions that we must examine have the form 

cxp \\ (/i^E-Vy - M7 E 7 2 M7 ) ) • (4.86) 
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Since 

/i££yV - ^ S 7 2 A*7 = (^2 E W») S ^ {^2^' X l'y) ( 4 ' 87 ) 

"(^Ww) S 7 2 (^ S 7^) ( 4 - 88 ) 
1 ..Tv fvTv i r \-lvT. 



= X^{X^,X^> + — /| 7 ,| X | 7 ,|) X Y y 

X lV X ~t X l + y2 7 |7|x| 7 |) X 7^' 

we focus on bounding the expression 

/ ^2 ^.2 \ 

For all 7 G {0, 1} P , let P 7 be the orthogonal projection onto span{X,; : 7$ = 1}. 
Then 

-XyP^yXy + y^^h'lxh'l) _ -^7(^7 ^7 + 7^2 ^171 xM) -1 ^ 

d Py - P 7 + (^7 - ^7(^7 ^7 + ^MxM)" 1 ^) > ( 4 - 89 ) 
in the positive definite ordering. By Lemma 4.3 we have 



a 



2 



V 1 [P-r - ^7(^7 X "f + t72 J |7|x|7|)" X , ) V (4-90) 

On the other hand, we may expand 

y T {P 1 ,-P 1 )y = (X 7 o/3 7 o) T (P y -P 7 )X 7o /3 7 o (4.92) 
+2 (X 7 o/3 7 o) T (Py - P 7 )e (4.93) 
+e T (Py -P 7 )e,. (4.94) 

By applying Lemmas 4.5, 4.6, and 4.2 to (4.92), (4.93), and (4.94) respectively, 
we obtain the bound 

y T (P Y -P^y < - n (l-5-^)\\j3 lOw n 2 

+4(T i 1 + r~s) V(l + ")5nlogp||^„ N7 ,|| <2 



a 2 
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Since 7' £ {0, 1}^ \ G, we may now employ Lemma 4.7 and the fact that 7' £ 
{0, 1}^ \ G to obtain the bounds 

y T {P^-P^)y < -^fl-S-^\\\^ 0W \\l+4^(l + a)Slogp 
< f-^fl-5- I ^)+4^(l + «)^- 



=-2ct 2 (1+i)) 

Accumulating the bounds we have exhibited thus far, we have 

i (^SyAV - S 7 Vy) < -(1 + ^)^logP + ^ n(1 _ ^ + g2 NIL 
and hence 

det(£ y ) ^-y,-^-^) / n(l + + a 2 \ 5/2 

det(S 7 ) " U(l-^)^ 2 + ^ 2 / 

for all 7 < 7 and 7' € {0, 1}^. Therefore, we may bound (4.83) from below by 

1 



1 

> 



1 



1 

> 



/ \ S/2 " y "e 2 

+ \«(l-<5)V 2 +<7 2 y e P V lS7 

1 



using the fact that (5) < ■ This completes the proof. □ 
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